- 
                Notifications
    You must be signed in to change notification settings 
- Fork 25.6k
[ES|QL] Consider min/max from predicates when transform date_trunc/bucket to round_to #131341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ES|QL] Consider min/max from predicates when transform date_trunc/bucket to round_to #131341
Conversation
| Hi @fang-xing-esql, I've created a changelog YAML for you. | 
|  | ||
| private List<EsqlBinaryComparison> predicates(Eval eval, Expression field) { | ||
| List<EsqlBinaryComparison> binaryComparisons = new ArrayList<>(); | ||
| eval.forEachUp(Filter.class, filter -> { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This scans up from the root of the query, right? Is there a way to put something between the eval and the filter that makes the scan invalid? I know if we weren't dealing with NameId you could rename stuff. But that shouldn't be possible now. But is there anything that is?
Like, maybe you can filter on timestamp and then change it. Or is that never possible because of NameId too?
I think this is fine actually. But I don't know enough!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m going to add some more queries to the functional tests, there will definitely be cases, like the eval, rename commands, that can complicate things. I implemented the rewrite on the data node rather than the coordinator node to avoid transport version changes. My thinking was that the coordinator node might be able to simplify some of the more complex scenarios before the plan reaches the data node.
I haven’t specifically handled NameIds yet, but I’ll keep an eye on them as I test with more complex queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some more queries with Eval, Rename that change the field reference by DateTrunc and Bucket. We have a couple of rules that are smart enough to find the real field, even though they are changed by Eval or Rename earlier.
        
          
                ...sticsearch/xpack/esql/optimizer/rules/logical/local/LocalSubstituteSurrogateExpressions.java
              
                Outdated
          
            Show resolved
            Hide resolved
        
      | Pinging @elastic/es-analytical-engine (Team:Analytics) | 
| @astefan @bpintea Could you help review the changes to  I added some new tests covering various kinds of predicates on the field, including overlaps or non-overlaps with Lucene statistics, and also some queries involving  I'd really appreciate your thoughts, if you can think of any other way to break/test it please let me know, thanks in advance! | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way the LocalSubstituteSurrogateExpressions looks like in main is a generic rule that deals with local surrogate expressions of any kind and applies the surrogate logic in functions found in eval.
This PR though, leaks the logic of specific surrogate expressions (like the ones of DateTrunc and Bucket) in LocalSubstituteSurrogateExpressions like finding all the EsqlBinaryComparisons, checking if they act on a foldable value etc.
Maybe these two functions (bucket and date_trunc) should not use the surrogate expressions approach to have them changed to round_to depending on things that are not only the ones from SearchStats.
| Holder<Boolean> foundMaxValue = new Holder<>(false); | ||
| for (EsqlBinaryComparison binaryComparison : binaryComparisons) { | ||
| if (binaryComparison.right() instanceof Literal l) { | ||
| long value = Long.parseLong(l.value().toString()); | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we doing here the parsing of a long from String? The dataType of the Literal is not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, I'll refactor this.
| Long minFromPredicates = minMaxFromPredicates.v1(); | ||
| Long maxFromPredicates = minMaxFromPredicates.v2(); | ||
| // Consolidate min/max from SearchStats and query | ||
| if (minFromPredicates instanceof Long minValue) { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This value here should already be a Long. Why the instanceof?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a check for null, not instanceof here, I'll change it.
| 
 That is a good point. Actually at the beginning, I was debating between two implementations - a generalized transformation leveraging the surrogate expression mechanism, or a dedicated rewrite rule specifically targeting  This PR is the first approach, and #132143 is the second. They both can achieve the goal, I'm open to either way. It is true that the predicates analysis can happen on the coordinator node, the main reason that I choose to evaluate them together with  | 
| Close this as we decide to take this approach - #132143 | 
This is a follow up enhancement to #128639, aimed at improving the
date_histogramrelatedES|QLperformance regression tests.The
nyc_taxisdataset is used in thedate_histogramanddate_truncrelatedES|QLperformance regression tests. It contains a wide date range, from the year 1900 to 2253. If relying only on Lucene statistics to calculate the fixed rounding points for thedropoff_datetimefield, the fixed rounding points could not be calculated due to this wide range. As a result, thedate_truncfunction in the existing performance regression tests is not rewritten to round_to, even though the query below only processes two months of data.This PR takes the query predicates into account and calculates the fixed rounding points based on the intersection of the date range from SearchStats (1900–2253) and the query's filter range (2015-01-01 to 2025-03-01). By using this narrowed intersection (2015-01-01 to 2025-03-01), the
date_truncfunction can be safely optimized toround_to.Performance Observations
The three queries below are measured against main and this branch, with different calendar intervals and various data volume within each date range. The observation is that different
RoundToevaluators perform differently comparing withDateTruncDatetimeEvaluator. TheThe elapsed time below are the average of 10 runs of each query on main and branch
Looking more closely at Q1, where a slight performance degradation was observed on the 3-month dataset using 9 weekly buckets with
RoundToLongLinearSearchEvaluator, I noticed that the query behaves differently depending on the evaluator used. Among the three RoundTo evaluators, the manualRoundToLong9Evaluatorperforms the best, and it also outperformsDateTruncDatetimeEvaluator.There are a couple of follow-up tasks for the performance-related work:
RoundToLongLinearSearchEvaluatorwith manual evaluators, as experiments have shown that the manual versions perform better. [ES|QL] Replace RoundTo linear search evaluator with manual evaluators #131733